DSC540 Final Project¶

Online Gaming Anxiety Analysis using Classification¶

By Vatsal Parikh¶

Data: https://www.kaggle.com/datasets/divyansh22/online-gaming-anxiety-data?resource=download

Project Description¶

In this project I will analyze Anxiety of Gamers. I have used multiple visualization and have tried to classify if gamer has concerning anxiety level or not on basis of their nature.

I used the combination of GAD (General Anxiety Disorder), SWL (Satisfaction with Life) and SPIN (Social Phobia Inventory) scores to build the target variable.

Importing Libraries¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import plotly.figure_factory as ff
import plotly.express as px
from kmodes.kprototypes import KPrototypes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
import warnings

warnings.filterwarnings('ignore')

Reading Dataset¶

In [2]:
df = pd.read_csv("GamingStudy_data.csv", encoding= 'unicode_escape')
In [3]:
df.head()
Out[3]:
S. No. Timestamp GAD1 GAD2 GAD3 GAD4 GAD5 GAD6 GAD7 GADE ... Birthplace Residence Reference Playstyle accept GAD_T SWL_T SPIN_T Residence_ISO3 Birthplace_ISO3
0 1 42052.00437 0 0 0 0 1 0 0 Not difficult at all ... USA USA Reddit Singleplayer Accept 1 23 5.0 USA USA
1 2 42052.00680 1 2 2 2 0 1 0 Somewhat difficult ... USA USA Reddit Multiplayer - online - with strangers Accept 8 16 33.0 USA USA
2 3 42052.03860 0 2 2 0 0 3 1 Not difficult at all ... Germany Germany Reddit Singleplayer Accept 8 17 31.0 DEU DEU
3 4 42052.06804 0 0 0 0 0 0 0 Not difficult at all ... USA USA Reddit Multiplayer - online - with online acquaintanc... Accept 0 17 11.0 USA USA
4 5 42052.08948 2 1 2 2 2 3 2 Very difficult ... USA South Korea Reddit Multiplayer - online - with strangers Accept 14 14 13.0 KOR USA

5 rows × 55 columns

In [4]:
df.columns
Out[4]:
Index(['S. No.', 'Timestamp', 'GAD1', 'GAD2', 'GAD3', 'GAD4', 'GAD5', 'GAD6',
       'GAD7', 'GADE', 'SWL1', 'SWL2', 'SWL3', 'SWL4', 'SWL5', 'Game',
       'Platform', 'Hours', 'earnings', 'whyplay', 'League', 'highestleague',
       'streams', 'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4', 'SPIN5', 'SPIN6',
       'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 'SPIN11', 'SPIN12', 'SPIN13',
       'SPIN14', 'SPIN15', 'SPIN16', 'SPIN17', 'Narcissism', 'Gender', 'Age',
       'Work', 'Degree', 'Birthplace', 'Residence', 'Reference', 'Playstyle',
       'accept', 'GAD_T', 'SWL_T', 'SPIN_T', 'Residence_ISO3',
       'Birthplace_ISO3'],
      dtype='object')

Removing columns which won't be useful for analysis¶

In [5]:
df = df.drop(columns=['S. No.', 'Timestamp', 'League', 'highestleague', 'Narcissism', 'Birthplace', 'Residence',
                 'accept', 'Birthplace_ISO3', 'GAD1', 'GAD2', 'GAD3', 'GAD4', 'GAD5',
                 'GAD6', 'GAD7', 'SWL1', 'SWL2', 'SWL3', 'SWL4', 'SWL5', 'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4',
                'SPIN5', 'SPIN6', 'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 'SPIN11', 'SPIN12', 'SPIN13',
                'SPIN14', 'SPIN15', 'SPIN16', 'SPIN17'])
In [6]:
df.head()
Out[6]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle GAD_T SWL_T SPIN_T Residence_ISO3
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 I play for fun having fun 0.0 Male 25 Unemployed / between jobs Bachelor (or equivalent) Reddit Singleplayer 1 23 5.0 USA
1 Somewhat difficult Other PC 8.0 I play for fun having fun 2.0 Male 41 Unemployed / between jobs Bachelor (or equivalent) Reddit Multiplayer - online - with strangers 8 16 33.0 USA
2 Not difficult at all Other PC 0.0 I play for fun having fun 0.0 Female 32 Employed Bachelor (or equivalent) Reddit Singleplayer 8 17 31.0 DEU
3 Not difficult at all Other PC 20.0 I play for fun improving 5.0 Male 28 Employed Bachelor (or equivalent) Reddit Multiplayer - online - with online acquaintanc... 0 17 11.0 USA
4 Very difficult Other Console (PS, Xbox, ...) 20.0 I play for fun having fun 1.0 Male 19 Employed High school diploma (or equivalent) Reddit Multiplayer - online - with strangers 14 14 13.0 KOR

Preprocessing¶

Data Cleaning¶

I manually filtered each column here after going through all possible combinations.

In [7]:
df["Platform"] = df["Platform"].map(lambda x: x.rstrip('(PS, Xbox, ...)') if "console" in x else x)
In [8]:
df["Playstyle"] = df["Playstyle"].map(lambda x: "Multiplayer" if any([y in x.lower() for y in ["multiplayer", "online", "friend", "stranger", "internet", "match"]]) 
                    else "SinglePlayer" if any([y in x.lower() for y in ["single", "alone", "solo", "one"]]) 
                    else "Both" if any([y in x.lower() for y in ["all", "both", "everything", "mix", "5"]]) else "Other")
In [9]:
df["earnings"] = df["earnings"].map(lambda x: 1 if any([y in x.lower() for y in ["earn", "money", "both", "pay", "paid", "living", "$", "pro", "career", "job", "tourn"]]) else 0)
In [10]:
df["whyplay"] = df["whyplay"].map(lambda x: "all" if all([y in x.lower() for y in ["fun", "improv", "relax", "win"]]) or 
                  any([y in x.lower() for y in ["all", "a b c", "4", "any", "everything"]])
                  else "improve and relax" if all([y in x.lower() for y in ["improv", "relax"]])
                  else "improve and win" if all([y in x.lower() for y in ["improv", "win"]]) or 
                  any([y in x.lower() for y in ["goal"]])
                  else "relax and win" if all([y in x.lower() for y in ["relax", "win"]])
                  else "fun and relax" if all([y in x.lower() for y in ["fun", "relax"]]) or 
                  any([y in x.lower() for y in ["friend", "passing"]])
                  else "fun and improve" if all([y in x.lower() for y in ["fun", "improv"]])
                  else "fun and winning" if all([y in x.lower() for y in ["fun", "winning"]]) or 
                  any([y in x.lower() for y in ["loot"]])
                  else "fun" if "fun" in x or 
                  any([y in x.lower() for y in ["bored", "socializ"]]) 
                  else "improve" if "improv" in x
                  else "relax" if any([y in x.lower() for y in ["relax", "stress", "forget", "depress", "distract", "wast", "escap", "problem"]])
                  else "win" if "win" in x else "Other")
In [11]:
df["Degree"] = df["Degree"].map(lambda x: "Bachelor" if "Bachelor" in x
                  else "High School" if "High School" in x 
                  else "Doctorate" if all([y in x for y in ["Ph.D.", "Psy. D.", "MD"]])
                  else "Master" if "Master" in x else "Other")
In [12]:
df.head()
Out[12]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle GAD_T SWL_T SPIN_T Residence_ISO3
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 0 fun 0.0 Male 25 Unemployed / between jobs Bachelor Reddit SinglePlayer 1 23 5.0 USA
1 Somewhat difficult Other PC 8.0 0 fun 2.0 Male 41 Unemployed / between jobs Bachelor Reddit Multiplayer 8 16 33.0 USA
2 Not difficult at all Other PC 0.0 0 fun 0.0 Female 32 Employed Bachelor Reddit SinglePlayer 8 17 31.0 DEU
3 Not difficult at all Other PC 20.0 0 improve 5.0 Male 28 Employed Bachelor Reddit Multiplayer 0 17 11.0 USA
4 Very difficult Other Console (PS, Xbox, ...) 20.0 0 fun 1.0 Male 19 Employed Other Reddit Multiplayer 14 14 13.0 KOR
In [13]:
df.describe()
Out[13]:
Hours earnings streams Age GAD_T SWL_T SPIN_T
count 13434.000000 13464.000000 13364.000000 13464.000000 13464.000000 13464.000000 12814.000000
mean 22.247357 0.087938 11.233538 20.930407 5.211973 19.788844 19.848525
std 70.284502 0.283216 78.549209 3.300897 4.713267 7.229243 13.467493
min 0.000000 0.000000 0.000000 18.000000 0.000000 5.000000 0.000000
25% 12.000000 0.000000 4.000000 18.000000 2.000000 14.000000 9.000000
50% 20.000000 0.000000 8.000000 20.000000 4.000000 20.000000 17.000000
75% 28.000000 0.000000 15.000000 22.000000 8.000000 26.000000 28.000000
max 8000.000000 1.000000 9001.000000 63.000000 21.000000 35.000000 68.000000

Handling Null Values¶

In [14]:
df.isna().sum()
Out[14]:
GADE              649
Game                0
Platform            0
Hours              30
earnings            0
whyplay             0
streams           100
Gender              0
Age                 0
Work               38
Degree              0
Reference          15
Playstyle           0
GAD_T               0
SWL_T               0
SPIN_T            650
Residence_ISO3    110
dtype: int64
In [15]:
df['SPIN_T'].fillna(value=df['SPIN_T'].mean(), inplace=True)
In [16]:
df.dropna(subset=['GADE'], inplace=True)
df.dropna(subset=['Hours'], inplace=True)
df.dropna(subset=['streams'], inplace=True)
df.dropna(subset=['Work'], inplace=True)
df.dropna(subset=['Residence_ISO3'], inplace=True)
df.dropna(subset=['Reference'], inplace=True)
In [17]:
df.isna().sum()
Out[17]:
GADE              0
Game              0
Platform          0
Hours             0
earnings          0
whyplay           0
streams           0
Gender            0
Age               0
Work              0
Degree            0
Reference         0
Playstyle         0
GAD_T             0
SWL_T             0
SPIN_T            0
Residence_ISO3    0
dtype: int64

Handling Outliers¶

In [18]:
df.drop(df[df.Hours >= 120].index, inplace=True)
In [19]:
df.drop(df[df.streams >= 120].index, inplace=True)

Label Encoding¶

In [20]:
df['Residence_ISO3']= LabelEncoder().fit_transform(df['Residence_ISO3'])
In [21]:
df.head()
Out[21]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle GAD_T SWL_T SPIN_T Residence_ISO3
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 0 fun 0.0 Male 25 Unemployed / between jobs Bachelor Reddit SinglePlayer 1 23 5.0 102
1 Somewhat difficult Other PC 8.0 0 fun 2.0 Male 41 Unemployed / between jobs Bachelor Reddit Multiplayer 8 16 33.0 102
2 Not difficult at all Other PC 0.0 0 fun 0.0 Female 32 Employed Bachelor Reddit SinglePlayer 8 17 31.0 23
3 Not difficult at all Other PC 20.0 0 improve 5.0 Male 28 Employed Bachelor Reddit Multiplayer 0 17 11.0 102
4 Very difficult Other Console (PS, Xbox, ...) 20.0 0 fun 1.0 Male 19 Employed Other Reddit Multiplayer 14 14 13.0 56

Manual Encoding based on score cut-off¶

GAD (General Anxiety Disorder)¶

In [22]:
conditions = [
    (df['GAD_T'] <= 4),
    (df['GAD_T'] >= 5) & (df['GAD_T'] <= 9),
    (df['GAD_T'] >= 10) & (df['GAD_T'] <= 14),
    (df['GAD_T'] >= 15)
    ]
In [23]:
values = ['minimal', 'mild', 'moderate', 'severe']
In [24]:
df['GAD'] = np.select(conditions, values)
df = df.drop(["GAD_T"], axis=1)
df.head()
Out[24]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle SWL_T SPIN_T Residence_ISO3 GAD
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 0 fun 0.0 Male 25 Unemployed / between jobs Bachelor Reddit SinglePlayer 23 5.0 102 minimal
1 Somewhat difficult Other PC 8.0 0 fun 2.0 Male 41 Unemployed / between jobs Bachelor Reddit Multiplayer 16 33.0 102 mild
2 Not difficult at all Other PC 0.0 0 fun 0.0 Female 32 Employed Bachelor Reddit SinglePlayer 17 31.0 23 mild
3 Not difficult at all Other PC 20.0 0 improve 5.0 Male 28 Employed Bachelor Reddit Multiplayer 17 11.0 102 minimal
4 Very difficult Other Console (PS, Xbox, ...) 20.0 0 fun 1.0 Male 19 Employed Other Reddit Multiplayer 14 13.0 56 moderate
In [25]:
conditions = [
    (df['SWL_T'] <= 9),
    (df['SWL_T'] >= 10) & (df['SWL_T'] <= 14),
    (df['SWL_T'] >= 15) & (df['SWL_T'] <= 19),
    (df['SWL_T'] == 20),
    (df['SWL_T'] >= 21) & (df['SWL_T'] <= 25),
    (df['SWL_T'] >= 26) & (df['SWL_T'] <= 30),
    (df['SWL_T'] >= 31) & (df['SWL_T'] <= 35)
]
In [26]:
values = ['extremely dissatisfied', 'dissatisfied', 'slightly dissatisfied', 'neutral', 'slightly satisfied', 'satisfied', 'extremely satisfied']

SWL (Satisfaction with Life)¶

In [27]:
df['SWL'] = np.select(conditions, values)
df = df.drop(["SWL_T"], axis=1)
df.head()
Out[27]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle SPIN_T Residence_ISO3 GAD SWL
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 0 fun 0.0 Male 25 Unemployed / between jobs Bachelor Reddit SinglePlayer 5.0 102 minimal slightly satisfied
1 Somewhat difficult Other PC 8.0 0 fun 2.0 Male 41 Unemployed / between jobs Bachelor Reddit Multiplayer 33.0 102 mild slightly dissatisfied
2 Not difficult at all Other PC 0.0 0 fun 0.0 Female 32 Employed Bachelor Reddit SinglePlayer 31.0 23 mild slightly dissatisfied
3 Not difficult at all Other PC 20.0 0 improve 5.0 Male 28 Employed Bachelor Reddit Multiplayer 11.0 102 minimal slightly dissatisfied
4 Very difficult Other Console (PS, Xbox, ...) 20.0 0 fun 1.0 Male 19 Employed Other Reddit Multiplayer 13.0 56 moderate dissatisfied
In [28]:
conditions = [
    (df['SPIN_T'] <= 20),
    (df['SPIN_T'] >= 21) & (df['SPIN_T'] <= 30),
    (df['SPIN_T'] >= 31) & (df['SPIN_T'] <= 40),
    (df['SPIN_T'] >= 41) & (df['SPIN_T'] <= 50),
    (df['SPIN_T'] >= 51)
    ]
In [29]:
values = ['minimal', 'mild', 'moderate', 'severe', 'extreme']

SPIN (Social Phobia Inventory)¶

In [30]:
df['SPIN'] = np.select(conditions, values)
df = df.drop(["SPIN_T"], axis=1)
df.head()
Out[30]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle Residence_ISO3 GAD SWL SPIN
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 0 fun 0.0 Male 25 Unemployed / between jobs Bachelor Reddit SinglePlayer 102 minimal slightly satisfied minimal
1 Somewhat difficult Other PC 8.0 0 fun 2.0 Male 41 Unemployed / between jobs Bachelor Reddit Multiplayer 102 mild slightly dissatisfied moderate
2 Not difficult at all Other PC 0.0 0 fun 0.0 Female 32 Employed Bachelor Reddit SinglePlayer 23 mild slightly dissatisfied moderate
3 Not difficult at all Other PC 20.0 0 improve 5.0 Male 28 Employed Bachelor Reddit Multiplayer 102 minimal slightly dissatisfied minimal
4 Very difficult Other Console (PS, Xbox, ...) 20.0 0 fun 1.0 Male 19 Employed Other Reddit Multiplayer 56 moderate dissatisfied minimal

Correlation Matrix¶

In [31]:
corr = df.corr()
z = np.array(corr)

fig = ff.create_annotated_heatmap(z, x = list(corr.columns), y = list(corr.index),
                                  annotation_text = np.around(z, decimals=2),
                                  hoverinfo='z')
fig.show()

Histogram of Age Column¶

In [32]:
fig = px.histogram(df, x="Age")
fig.show()

Side by Side Bar Graph of Platform by Gender¶

In [33]:
fig = px.histogram(df, y="Platform", color="Gender", barmode="group", log_x=True)
fig.show()

Side by Side Bar Graph of Playstyle by Gender¶

In [34]:
fig = px.histogram(df, y="Playstyle", color="Gender", barmode="group", log_x=True)
fig.show()

Violin Plot of Work vs Hours¶

In [35]:
fig = px.violin(df, y="Hours", x="Work",color="Work", hover_data=[df.Hours])
fig.update_layout(showlegend=False)
fig.show()

Pie Chart of GAD (General Anxiety Disorder)¶

In [36]:
fig = px.pie(df, values=df.index, names='GAD')
fig.show()

Pie Chart of SWL (Satisfaction with Life)¶

In [37]:
fig = px.pie(df, values=df.index, names='SWL')
fig.show()

Pie Chart of SPIN (Social Phobia Inventory)¶

In [38]:
fig = px.pie(df, values=df.index, names='SPIN')
fig.show()

I started by defining my own label column based on cut-off values, but I was not able to separate the data properly and I was always underfitting. So, I tried using clustering here to find patterns.

Clustering¶

k-Prototype¶

I am using k-Prototype here as k-Means does not work with Categorical values and k-Mode only works with Categorical values. The k-Prototype algorithm is an extension to the k-Modes algorithm that combines the k-modes and k-means algorithms and is able to cluster mixed numerical and categorical variables.

In [39]:
categorical_features_idx = [0, 1, 2, 4, 5, 7, 9, 10, 11, 12, 13, 14, 15, 16]
In [40]:
mark_array=df.values

K-Prototype Clustering¶

In [41]:
kproto = KPrototypes(n_clusters=4, verbose=2, max_iter=20).fit(mark_array, categorical=categorical_features_idx)
Initialization method and algorithm are deterministic. Setting n_init to 1.
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 1, iteration: 1/20, moves: 4261, ncost: 1782368.1216841005
Run: 1, iteration: 2/20, moves: 2786, ncost: 1645058.848518113
Run: 1, iteration: 3/20, moves: 1769, ncost: 1564083.6645404622
Run: 1, iteration: 4/20, moves: 899, ncost: 1541589.3414343747
Run: 1, iteration: 5/20, moves: 159, ncost: 1539578.6743116563
Run: 1, iteration: 6/20, moves: 12, ncost: 1539569.4896820632
Run: 1, iteration: 7/20, moves: 1, ncost: 1539569.4317001232
Run: 1, iteration: 8/20, moves: 0, ncost: 1539569.4317001232
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 2, iteration: 1/20, moves: 1043, ncost: 1751398.762116424
Run: 2, iteration: 2/20, moves: 1187, ncost: 1585792.9930233965
Run: 2, iteration: 3/20, moves: 677, ncost: 1553942.7185046503
Run: 2, iteration: 4/20, moves: 193, ncost: 1550902.3147146506
Run: 2, iteration: 5/20, moves: 35, ncost: 1550864.7742551342
Run: 2, iteration: 6/20, moves: 5, ncost: 1550863.7879512229
Run: 2, iteration: 7/20, moves: 0, ncost: 1550863.7879512229
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 3, iteration: 1/20, moves: 2423, ncost: 1558492.0816207244
Run: 3, iteration: 2/20, moves: 1057, ncost: 1529640.8407042166
Run: 3, iteration: 3/20, moves: 114, ncost: 1528484.1825467008
Run: 3, iteration: 4/20, moves: 30, ncost: 1528403.7660156093
Run: 3, iteration: 5/20, moves: 5, ncost: 1528401.6038185547
Run: 3, iteration: 6/20, moves: 0, ncost: 1528401.6038185547
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 4, iteration: 1/20, moves: 3207, ncost: 1663611.4336055913
Run: 4, iteration: 2/20, moves: 1171, ncost: 1558765.5763644057
Run: 4, iteration: 3/20, moves: 552, ncost: 1539304.1384839974
Run: 4, iteration: 4/20, moves: 341, ncost: 1530093.5799751224
Run: 4, iteration: 5/20, moves: 98, ncost: 1528550.7755063504
Run: 4, iteration: 6/20, moves: 37, ncost: 1528403.7660156093
Run: 4, iteration: 7/20, moves: 5, ncost: 1528401.6038185547
Run: 4, iteration: 8/20, moves: 0, ncost: 1528401.6038185547
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 5, iteration: 1/20, moves: 3647, ncost: 1802212.9587056346
Run: 5, iteration: 2/20, moves: 3340, ncost: 1583650.6075399693
Run: 5, iteration: 3/20, moves: 1204, ncost: 1543359.7662458392
Run: 5, iteration: 4/20, moves: 286, ncost: 1539676.4314685452
Run: 5, iteration: 5/20, moves: 39, ncost: 1539582.17124867
Run: 5, iteration: 6/20, moves: 12, ncost: 1539569.3605938305
Run: 5, iteration: 7/20, moves: 1, ncost: 1539569.3259381542
Run: 5, iteration: 8/20, moves: 0, ncost: 1539569.3259381542
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 6, iteration: 1/20, moves: 4281, ncost: 1641081.242197436
Run: 6, iteration: 2/20, moves: 1841, ncost: 1554717.4498859472
Run: 6, iteration: 3/20, moves: 634, ncost: 1540122.1209519957
Run: 6, iteration: 4/20, moves: 96, ncost: 1539576.6221213548
Run: 6, iteration: 5/20, moves: 11, ncost: 1539569.3259381542
Run: 6, iteration: 6/20, moves: 0, ncost: 1539569.3259381542
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 7, iteration: 1/20, moves: 2671, ncost: 1675566.0681372103
Run: 7, iteration: 2/20, moves: 1769, ncost: 1580447.8998761047
Run: 7, iteration: 3/20, moves: 1277, ncost: 1543393.8565527913
Run: 7, iteration: 4/20, moves: 275, ncost: 1539625.6744574825
Run: 7, iteration: 5/20, moves: 25, ncost: 1539569.4896820632
Run: 7, iteration: 6/20, moves: 1, ncost: 1539569.4317001232
Run: 7, iteration: 7/20, moves: 0, ncost: 1539569.4317001232
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 8, iteration: 1/20, moves: 3110, ncost: 1622543.7908887756
Run: 8, iteration: 2/20, moves: 1940, ncost: 1548212.6714456188
Run: 8, iteration: 3/20, moves: 542, ncost: 1539777.1065220442
Run: 8, iteration: 4/20, moves: 59, ncost: 1539591.5378910885
Run: 8, iteration: 5/20, moves: 14, ncost: 1539570.811111528
Run: 8, iteration: 6/20, moves: 5, ncost: 1539569.3259381542
Run: 8, iteration: 7/20, moves: 0, ncost: 1539569.3259381542
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 9, iteration: 1/20, moves: 4056, ncost: 1855342.234283102
Run: 9, iteration: 2/20, moves: 2125, ncost: 1709782.9445753312
Run: 9, iteration: 3/20, moves: 656, ncost: 1693400.576341686
Run: 9, iteration: 4/20, moves: 546, ncost: 1684109.1928503346
Run: 9, iteration: 5/20, moves: 411, ncost: 1679676.7154780398
Run: 9, iteration: 6/20, moves: 126, ncost: 1679153.160412582
Run: 9, iteration: 7/20, moves: 97, ncost: 1678604.8229056443
Run: 9, iteration: 8/20, moves: 354, ncost: 1672859.436926949
Run: 9, iteration: 9/20, moves: 340, ncost: 1668193.058706097
Run: 9, iteration: 10/20, moves: 105, ncost: 1667707.397170218
Run: 9, iteration: 11/20, moves: 25, ncost: 1667685.9769660353
Run: 9, iteration: 12/20, moves: 11, ncost: 1667644.8612211242
Run: 9, iteration: 13/20, moves: 29, ncost: 1667338.6539913032
Run: 9, iteration: 14/20, moves: 12, ncost: 1667318.463251469
Run: 9, iteration: 15/20, moves: 1, ncost: 1667318.4209270775
Run: 9, iteration: 16/20, moves: 0, ncost: 1667318.4209270775
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 10, iteration: 1/20, moves: 2951, ncost: 1591863.148218282
Run: 10, iteration: 2/20, moves: 1421, ncost: 1554241.0215132083
Run: 10, iteration: 3/20, moves: 382, ncost: 1550923.507064586
Run: 10, iteration: 4/20, moves: 49, ncost: 1550874.7637688234
Run: 10, iteration: 5/20, moves: 18, ncost: 1550866.05655826
Run: 10, iteration: 6/20, moves: 8, ncost: 1550863.9737742052
Run: 10, iteration: 7/20, moves: 2, ncost: 1550863.64056594
Run: 10, iteration: 8/20, moves: 0, ncost: 1550863.64056594
Best run was number 3
In [42]:
print(kproto.cluster_centroids_)
[['18.596765498652292' '27.628032345013477' '21.19191374663073'
  'Not difficult at all' 'League of Legends' 'PC' '0' 'improve' 'Male'
  'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102'
  'minimal' 'slightly dissatisfied' 'minimal']
 ['13.916678805535325' '6.450400582665695' '21.05972323379461'
  'Not difficult at all' 'League of Legends' 'PC' '0' 'fun' 'Male'
  'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102'
  'minimal' 'slightly satisfied' 'minimal']
 ['31.351577591757888' '8.328396651641983' '20.56342562781713'
  'Not difficult at all' 'League of Legends' 'PC' '0' 'improve' 'Male'
  'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102'
  'minimal' 'slightly dissatisfied' 'minimal']
 ['58.56467315716272' '14.584144645340752' '20.289290681502088'
  'Not difficult at all' 'League of Legends' 'PC' '0' 'improve' 'Male'
  'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102'
  'minimal' 'dissatisfied' 'minimal']]

But clustering didn't seem to work. No matter how many different combinations I tried, or how many clusters I made, I was not able to divide data such that all tests could have concerning cut-off value.

So, I will use my own target variable derived from those tests.

Creating target variable.¶

In [43]:
df["GAD"] = df["GAD"].map(lambda x: 1 if any([y in x for y in ["mild", "moderate", "severe"]]) else 0)
df["SWL"] = df["SWL"].map(lambda x: 1 if any([y in x for y in ["extremely dissatisfied", "dissatified"]]) else 0)
df["SPIN"] = df["SPIN"].map(lambda x: 1 if any([y in x for y in ["moderate", "severe", "extreme"]]) else 0)

My model will try to classify gamer experiences any amount of stress or anxiety using cut-off values by tests.

In [44]:
df['target'] = np.where((df['GAD']+df['SWL']+df['SPIN'] >= 1), 1, 0)
df = df.drop(columns= ["GAD", "SWL", "SPIN"])
df.head()
Out[44]:
GADE Game Platform Hours earnings whyplay streams Gender Age Work Degree Reference Playstyle Residence_ISO3 target
0 Not difficult at all Skyrim Console (PS, Xbox, ...) 15.0 0 fun 0.0 Male 25 Unemployed / between jobs Bachelor Reddit SinglePlayer 102 0
1 Somewhat difficult Other PC 8.0 0 fun 2.0 Male 41 Unemployed / between jobs Bachelor Reddit Multiplayer 102 1
2 Not difficult at all Other PC 0.0 0 fun 0.0 Female 32 Employed Bachelor Reddit SinglePlayer 23 1
3 Not difficult at all Other PC 20.0 0 improve 5.0 Male 28 Employed Bachelor Reddit Multiplayer 102 0
4 Very difficult Other Console (PS, Xbox, ...) 20.0 0 fun 1.0 Male 19 Employed Other Reddit Multiplayer 56 1

Generating dummies¶

In [45]:
df = pd.get_dummies(df, drop_first=False)
In [46]:
df.head()
Out[46]:
Hours earnings streams Age Residence_ISO3 target GADE_Extremely difficult GADE_Not difficult at all GADE_Somewhat difficult GADE_Very difficult ... Degree_Master Degree_Other Reference_CrowdFlower Reference_Other Reference_Reddit Reference_TeamLiquid.net Playstyle_Both Playstyle_Multiplayer Playstyle_Other Playstyle_SinglePlayer
0 15.0 0 0.0 25 102 0 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 1
1 8.0 0 2.0 41 102 1 0 0 1 0 ... 0 0 0 0 1 0 0 1 0 0
2 0.0 0 0.0 32 23 1 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 1
3 20.0 0 5.0 28 102 0 0 1 0 0 ... 0 0 0 0 1 0 0 1 0 0
4 20.0 0 1.0 19 56 1 0 0 0 1 ... 0 1 0 0 1 0 0 1 0 0

5 rows × 55 columns

Train-Test Split¶

In [47]:
y = df[['target']]
X = df.drop(['target'], axis=1)
In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.10)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train, test_size=0.10)
In [49]:
X_train.head()
Out[49]:
Hours earnings streams Age Residence_ISO3 GADE_Extremely difficult GADE_Not difficult at all GADE_Somewhat difficult GADE_Very difficult Game_Counter Strike ... Degree_Master Degree_Other Reference_CrowdFlower Reference_Other Reference_Reddit Reference_TeamLiquid.net Playstyle_Both Playstyle_Multiplayer Playstyle_Other Playstyle_SinglePlayer
13173 10.0 0 4.0 18 23 0 1 0 0 0 ... 0 1 0 1 0 0 0 0 0 1
5933 24.0 0 20.0 26 102 0 1 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
7731 35.0 0 2.0 20 102 0 1 0 0 0 ... 0 1 0 0 1 0 0 1 0 0
5118 20.0 0 2.0 22 80 0 0 0 1 0 ... 0 1 0 0 1 0 0 1 0 0
8933 50.0 0 10.0 20 31 0 0 1 0 0 ... 0 0 0 0 1 0 0 1 0 0

5 rows × 54 columns

In [50]:
y_train.head()
Out[50]:
target
13173 0
5933 0
7731 1
5118 1
8933 1

Standardization¶

In [51]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_val = ss.transform(X_val)
X_test = ss.transform(X_test)

Utility Function¶

In [52]:
def performance_metrics(y_val, pred_val):
    print('Recall: ', metrics.recall_score(y_val, pred_val))
    tn, fp, fn, tp = metrics.confusion_matrix(y_val, pred_val).ravel()
    print('Specificity: ', (tn / (tn + fp)))
    print('Precision: ', metrics.precision_score(y_val, pred_val))
    print('F1-score: ', metrics.f1_score(y_val, pred_val))
    print('Balanced Accuracy: ', metrics.balanced_accuracy_score(y_val, pred_val))

Classification¶

Random Forest¶

In [53]:
model = RandomForestClassifier()
In [54]:
model.fit(X_train, y_train)
Out[54]:
RandomForestClassifier()
In [55]:
pred_train = model.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.9925020827547902
In [56]:
pred_val = model.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7278797996661102

Hyperparameter Tuning¶

Defining hyperparameters¶

In [57]:
min_samples_splits = [10, 50, 100, 200]
max_depths = [2,5,10,15]
n_estimators = [100, 500]
params = {
    "min_samples_split": min_samples_splits,
    "max_depth": max_depths, 
    "n_estimators": n_estimators
    }

Grid Search¶

In [58]:
grid_search = GridSearchCV(estimator=model, param_grid=params, scoring="f1", n_jobs=-1)
In [59]:
grid_search.fit(X_train, y_train)
Out[59]:
GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': [2, 5, 10, 15],
                         'min_samples_split': [10, 50, 100, 200],
                         'n_estimators': [100, 500]},
             scoring='f1')

Best HyperParamater values¶

In [60]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train Accuracy :", grid_search.best_score_)
Best hyperparameter values:  {'max_depth': 10, 'min_samples_split': 200, 'n_estimators': 500}
Train Accuracy : 0.7500868186356877

Validation Score¶

In [61]:
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7549342105263157

Confusion Matrix for validation set¶

In [62]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[62]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e67796d0>

Performance Matrix for validation set¶

In [63]:
performance_metrics(y_val, pred_val)
Recall:  0.7637271214642263
Specificity:  0.7045454545454546
Precision:  0.7463414634146341
F1-score:  0.7549342105263157
Balanced Accuracy:  0.7341362880048404

Logistic Regression¶

In [64]:
estimator = LogisticRegression()
In [65]:
estimator.fit(X_train, y_train)
Out[65]:
LogisticRegression()
In [66]:
pred_train = estimator.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7499764750164675
In [67]:
pred_val = estimator.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7535269709543568

Hyperparameter Tuning¶

Defining hyperparameters¶

In [68]:
parameters = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'penalty': ['l2'],
    'C': [100, 10, 1.0, 0.1, 0.01]  
}

Grid Search¶

In [69]:
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
In [70]:
grid_search.fit(X_train,y_train)
Out[70]:
GridSearchCV(cv=10, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [100, 10, 1.0, 0.1, 0.01], 'penalty': ['l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear']},
             scoring='f1')

Best HyperParamater values¶

In [71]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Training score :", grid_search.best_score_)
Best hyperparameter values:  {'C': 0.01, 'penalty': 'l2', 'solver': 'newton-cg'}
Training score : 0.7489636142737558

Validation Score¶

In [72]:
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7545605306799336

Confusion Matrix for validation set¶

In [73]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[73]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e611f1c0>

Performance Matrix for validation set¶

In [74]:
performance_metrics(y_val, pred_val)
Recall:  0.757071547420965
Specificity:  0.7159090909090909
Precision:  0.7520661157024794
F1-score:  0.7545605306799336
Balanced Accuracy:  0.736490319165028

K Nearest Neighbors¶

In [75]:
estimator = KNeighborsClassifier()
In [76]:
estimator.fit(X_train, y_train)
Out[76]:
KNeighborsClassifier()
In [77]:
pred_train = estimator.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7852215817821033
In [78]:
pred_val = estimator.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.699581589958159

Hyperparameter Tuning¶

Defining hyperparameters¶

In [79]:
parameters = {
    'n_neighbors': range(1, 21, 2),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

Grid Search¶

In [80]:
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
In [81]:
grid_search.fit(X_train,y_train)
Out[81]:
GridSearchCV(cv=10, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
                         'n_neighbors': range(1, 21, 2),
                         'weights': ['uniform', 'distance']},
             scoring='f1')

Best HyperParamater values¶

In [82]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Training score :", grid_search.best_score_)
Best hyperparameter values:  {'metric': 'manhattan', 'n_neighbors': 19, 'weights': 'uniform'}
Training score : 0.7197998110096405

Validation Score¶

In [83]:
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7362270450751252

Confusion Matrix for validation set¶

In [84]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[84]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e4da98e0>

Performance Matrix for validation set¶

In [85]:
performance_metrics(y_val, pred_val)
Recall:  0.7337770382695508
Specificity:  0.7045454545454546
Precision:  0.7386934673366834
F1-score:  0.7362270450751252
Balanced Accuracy:  0.7191612464075027

Adaboost¶

In [86]:
model = AdaBoostClassifier()
In [87]:
model.fit(X_train, y_train)
Out[87]:
AdaBoostClassifier()
In [88]:
pred_train = model.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7503762227238524
In [89]:
pred_val = model.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7514546965918537

Hyperparameter Tuning¶

Defining hyperparameters¶

In [90]:
parameters = {
    'n_estimators': range(10, 200, 5),
    'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10, 100]
}

Grid Search¶

In [91]:
grid_search = GridSearchCV(model, parameters, cv=10, n_jobs=-1, scoring='f1')
In [92]:
grid_search.fit(X_train,y_train)
Out[92]:
GridSearchCV(cv=10, estimator=AdaBoostClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10,
                                           100],
                         'n_estimators': range(10, 200, 5)},
             scoring='f1')

Best HyperParamater values¶

In [93]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train score :", grid_search.best_score_)
Best hyperparameter values:  {'learning_rate': 1, 'n_estimators': 45}
Train score : 0.7491643493073873

Validation Score¶

In [94]:
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7518672199170123

Confusion Matrix for validation set¶

In [95]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[95]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e621dbb0>

Performance Matrix for validation set¶

In [96]:
performance_metrics(y_val, pred_val)
Recall:  0.7537437603993344
Specificity:  0.7140151515151515
Precision:  0.75
F1-score:  0.7518672199170123
Balanced Accuracy:  0.733879455957243

SVM Classifier¶

In [97]:
estimator = SVC(kernel ='rbf')
In [98]:
estimator.fit(X_train, y_train)
Out[98]:
SVC()
In [99]:
pred_train = estimator.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7579125847776941
In [100]:
pred_val = estimator.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7529215358931554

Hyperparameter Tuning¶

Defining hyperparameters¶

In [101]:
parameters = {
    'C': [1, 10, 100, 1000],
    'gamma': [0.001, 0.01, 0.1, 1]  
}

Grid Search¶

In [102]:
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
In [103]:
grid_search.fit(X_train,y_train)
Out[103]:
GridSearchCV(cv=10, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [1, 10, 100, 1000],
                         'gamma': [0.001, 0.01, 0.1, 1]},
             scoring='f1')

Best HyperParamater values¶

In [104]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Training score :", grid_search.best_score_)
Best hyperparameter values:  {'C': 10, 'gamma': 0.001}
Training score : 0.7473337588537545

Validation Score¶

In [105]:
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7572977481234363

Confusion Matrix for validation set¶

In [106]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[106]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e81cbdc0>

Performance Matrix for validation set¶

In [107]:
performance_metrics(y_val, pred_val)
Recall:  0.7554076539101497
Specificity:  0.7272727272727273
Precision:  0.7591973244147158
F1-score:  0.7572977481234363
Balanced Accuracy:  0.7413401905914385

Gradient Boosting¶

In [108]:
model = GradientBoostingClassifier()
In [109]:
model.fit(X_train, y_train)
Out[109]:
GradientBoostingClassifier()
In [110]:
pred_train = model.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7573742859818335
In [111]:
pred_val = model.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7524752475247526

Hyperparameter Tuning¶

Defining hyperparameters¶

In [112]:
parameters = {
    'n_estimators': range(1, 15),
    'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10, 100]
}

Grid Search¶

In [113]:
grid_search = GridSearchCV(model, parameters, cv=10, n_jobs=-1, scoring='f1')
In [114]:
grid_search.fit(X_train,y_train)
Out[114]:
GridSearchCV(cv=10, estimator=GradientBoostingClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10,
                                           100],
                         'n_estimators': range(1, 15)},
             scoring='f1')

Best HyperParamater values¶

In [115]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train score :", grid_search.best_score_)
Best hyperparameter values:  {'learning_rate': 0.25, 'n_estimators': 14}
Train score : 0.7490153737547018

Validation Score¶

In [116]:
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.757071547420965

Confusion Matrix for validation set¶

In [117]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[117]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1ee3fa910>

Performance Matrix for validation set¶

In [118]:
performance_metrics(y_val, pred_val)
Recall:  0.757071547420965
Specificity:  0.7234848484848485
Precision:  0.757071547420965
F1-score:  0.757071547420965
Balanced Accuracy:  0.7402781979529067

Next I tried PCA to check if I can get better results¶

Feature Importance using PCA¶

In [119]:
pca = PCA(n_components=None)

pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_val_pca = pca.transform(X_val)
X_test_pca = pca.transform(X_test)
In [120]:
plt.rcParams['figure.figsize'] = [15, 5]
In [121]:
print(pca.explained_variance_ratio_.cumsum())
plt.plot(pca.explained_variance_ratio_.cumsum(), '-o');
plt.xticks(ticks= range(X_train_pca.shape[1]), labels=[i+1 for i in range(X_train_pca.shape[1])])
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
[0.05667402 0.10590517 0.14508473 0.18283884 0.21769369 0.25088518
 0.28201313 0.31096007 0.33924632 0.36490784 0.38797049 0.41053378
 0.43226598 0.45377246 0.47472923 0.49514037 0.51533088 0.53523005
 0.55478265 0.57415596 0.59334971 0.61244475 0.63138635 0.65028774
 0.66909677 0.68772755 0.70630321 0.72470787 0.7430054  0.76125049
 0.77933777 0.79724634 0.81511447 0.83266108 0.85001506 0.86700541
 0.88381721 0.90045208 0.91693923 0.9333307  0.94915756 0.96445279
 0.9790912  0.99190021 1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.        ]
In [122]:
X_train_pca2 = X_train_pca[:, 0:39]
X_val_pca2 = X_val_pca[:, 0:39]
X_test_pca2 = X_test_pca[:, 0:39]

Using any model is fine here, as all of them gives around the same result. But I'm using SVM Classifier because it gives mariginally higher f1-score.

SVM Classifer using PCA¶

In [124]:
estimator = SVC(kernel ='rbf')
In [125]:
estimator.fit(X_train_pca2,y_train)
Out[125]:
SVC()
In [126]:
pred_train = estimator.predict(X_train_pca2)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7572047466566207
In [127]:
pred_val = estimator.predict(X_val_pca2)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7529215358931554

Hyperparameter Tuning¶

Defining hyperparameters¶

In [128]:
parameters = {
    'C': [1, 10, 100, 1000],
    'gamma': [0.001, 0.01, 0.1, 1]  
}

Grid Search¶

In [129]:
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
In [130]:
grid_search.fit(X_train_pca2,y_train)
Out[130]:
GridSearchCV(cv=10, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [1, 10, 100, 1000],
                         'gamma': [0.001, 0.01, 0.1, 1]},
             scoring='f1')

Best HyperParamater values¶

In [131]:
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train score :", grid_search.best_score_)
Best hyperparameter values:  {'C': 10, 'gamma': 0.001}
Train score : 0.7472882769622254

Validation Score¶

In [132]:
pred_val = grid_search.predict(X_val_pca2)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7572977481234363

Confusion Matrix for validation set¶

In [133]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
Out[133]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e62178e0>

Performance Matrix for validation set¶

In [134]:
performance_metrics(y_val, pred_val)
Recall:  0.7554076539101497
Specificity:  0.7272727272727273
Precision:  0.7591973244147158
F1-score:  0.7572977481234363
Balanced Accuracy:  0.7413401905914385

We can see that both methods give same result.¶

Next, I will use SVM Classifer with its best hyperparameters to check its test score. Using either PCA or standardized data is fine here. I will stick with standardized data.

SVM Classifier with C = 10, gamma = 0.001 and kernel = 'rbf' using Standardized Data¶

In [135]:
svm = SVC(C = 10, gamma = 0.001, kernel = 'rbf')
In [136]:
svm.fit(X_train, y_train)
Out[136]:
SVC(C=10, gamma=0.001)

Validation Score¶

In [137]:
pred_val = svm.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7572977481234363

Test Score¶

In [138]:
pred_test = svm.predict(X_test)
print("Test score:", metrics.f1_score(y_test, pred_test))
Test score: 0.7378787878787879

Confusion Matrix for test set¶

In [139]:
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_test, pred_test)).plot()
Out[139]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1ee3a6df0>

Performance Matrix for validation set¶

In [140]:
performance_metrics(y_test, pred_test)
Recall:  0.7279521674140508
Specificity:  0.7201365187713311
Precision:  0.7480798771121352
F1-score:  0.7378787878787879
Balanced Accuracy:  0.724044343092691

We can see that the model is not overfitting. Though the reason behind low accuracy is unbalanced dataset. We can try intense hyperparameter tuning here, but it will only lead to overfitting. Maybe we can try cleaning the data again with different approaches. In any case more data is required for further analysis.

I tried different values for target, but my current combinations gave me the best F1-score. It checks if any of the test(GAD, SWL, SPIN) has concerning value. In one combination, I even reached 82% accuracy. But the specificity was very low and the model couldn't be trusted, that's why after after many iterations I used this combination.

From this project I learned how important preprocessing is. Even though I spent days coming up with current solution, I still believe I haven't spent enough time with the data. If I had more time I would have definitely tried using text clustering too. In any case, this project has improved my analysis skill and I'm glad I used such complex dataset for my project.

Thank you¶